MapReduce: Distributed Computing for Machine Learning

نویسندگان

  • Dan Gillick
  • Arlo Faria
  • John DeNero
چکیده

We use Hadoop, an open-source implementation of Google’s distributed file system and the MapReduce framework for distributed data processing, on modestly-sized compute clusters to evaluate its efficacy for standard machine learning tasks. We show benchmark performance on searching and sorting tasks to investigate the effects of various system configurations. We also distinguish classes of machine-learning problems that are reasonable to address within the MapReduce framework, and offer improvements to the Hadoop implementation. We conclude that MapReduce is a good choice for basic operations on large datasets, although there are complications to be addressed for more complex machine learning tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Grid Based System for Data Mining Using MapReduce

In this paper, we discuss a Grid data mining system based on the MapReduce paradigm of computing. The MapReduce paradigm emphasizes system automation of fault tolerance and redundancy, while keeping the programming model for the user very simple. MapReduce is built closely on top of a distributed file system, that allows efficient distributed storage of large data sets, and allows computation t...

متن کامل

Chapter: Distributed Platforms and Cloud Services Enabling Machine Learning for Big Data. An Overview1

Applying popular machine learning algorithms to large amounts of data raised new challenges for machine learning practitioners. Traditional libraries does not support properly the processing of huge data sets, so that new approaches are needed. Using modern distributed computing paradigms, such as MapReduce, or in-memory processing novel machine learning libraries have been devised. In parallel...

متن کامل

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce

Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s...

متن کامل

A MapReduce based distributed SVM algorithm for binary classification

Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset. In previous studies on distributed machine learning algorithms, SVM is trained...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006